Research Question(s):
1. What is the best way to classify pitches by pitch type and what are the most influential variables in determining pitch type?
2. Is there a reliable way to use these findings to classify/predict pitches in real time?
Data Collection
All of our data was collected using the BaseballSavant search tool, filtering all pitches thrown in the MLB in May 2019. The unit of observation within the dataset are individual pitches. All data from within the dataset measures different characteristics for each pitch using TrackMan devices at each MLB stadium. We opted to only look at data from May because we hypothesized that this month would have the highest percentage of “crucial pitchers” on teams’ rosters being healthy and active, due to the fact that the season is just getting underway. We only use one month’s worth of data because this dataset is extremely large, so our computers have insufficient computing power to handle a larger dataset.
Plan for Modeling
In this study, we will attempt to predict the pitch type of each pitch thrown in the MLB during May 2019 using a variety of techniques. We begin by clustering pitches using the variables that are visible to the naked eye to exhibit the necessity for modeling that can process more variables in determining pitch type. We then examing exploratory plots that show different elements of the dataset. Next, we split the dataset by pitcher handedness and further divided each dataset (one for Left-Handed Pitchers and one for Right-Handed Pitchers) into training (70 % of observations) and test data (30 % of observations) we attempt to predict pitch types using Decision Trees, K-Nearest Neighbor Modeling, and Random Forest Modeling. Finally, we apply the models to test data to determine which one is the most accurate and suitable for accomplishing our goal. Lastly, we apply the model to real pitches to simulate its real-time predictive capabilities.
Data Cleaning
In the data cleaning portion, we select 17 variables to keep: pitch_type, pitch_name, release_speed, release_pos_x, release_pos_z, release_pos_y, player_name, p_throws, pfx_x, pfx_z, release_spin_rate, release_extension, home_score, away_score, plate_x, plate_z, and stand. These variables are useful in distinguishing between pitch types because they determine the behavior of a given pitch. We then made spin rate numeric, removed pitches that were classified as an Eephus or Knuckleball from the data because they are scarcely thrown and were likely outliers, removed all pitches that were thrown from heights greater than 8 feet tall due to the implausibility of this being an accurate reading (all of these pitches were thrown by 6’1" pitcher Dylan Covey), and removed pitches that were thrown slower than 60 MPH. We also dropped observations of pitches delivered by position players because they were almost certain to be of a lower caliber than those thrown by actual pitchers, thus deeming them likely to be outliers. Next, we omitted observations with missing values and grouped similar pitches Knuckle-Curve and Sinker into the Curveball and 2-Seam the categories, respectively. Finally, we split our cleaned dataset into four separate datasets, two training and two test datasets, for both lefties and righties, respectively, to account for the fact that movement profiles and their values are contingent upon pitcher handedness because this alters the direction in which different pitch types move.
Motivation
As can be seen in the GIFs below, it can be extremely difficult to distinguish between pitch types, even for experts, just from looking at the pitches with the naked eye. The links to videos below show each pitch type for Mike Leake of the Arizona Diamondbacks, controlling for the venue (all pitches are from the Diamondbacks Home Stadium, Chase Field), camera angle, and batter handedness for consistency to demonstrate just how daunting a task it can be to differentiate.
Goals of this Research
Some pitchers’ pitch types are easier to distinguish than others without knowledge of the values of crucial variables like velocity, spin rate, and movement. Even having this information at their disposal, many commentators and fans at games are left wondering what a pitch actually was after seeing it, which our project seeks to resolve. What are the main distinguishing metrics between pitch types? What are the characteristics of each pitch type? Our analysis below seeks to answer these questions.
Exploratory Plots
Dataset Breakdowns
Pitcher Handedness

The bar plot above exhibits the breakdown of pitches thrown in May 2019 by pitcher handedness. Using this graphic, it becomes clear that an overwhelmingly larger number of pitchers were delivered by righties than lefties.
Pitch Type Distribution

The plot above shows the distribution of each pitch type within the clean May 2019 Dataset. Unsuprisingly, 4-Seam Fastballs were the most commonly thrown pitch by a wide margin, with Slider, 2-Seam Fastball, Changeup, and Curveball trailing behind. Cutters and Split-Finger Fastballs were the least commonly thrown pitches within the data.
Clustering


Top Three Pitches Within Each Cluster for LHP by %:
Cluster 1:
82.22 % 4-Seam Fastball
9.23 % 2-Seam Fastball
4.56 % Changeup
Cluster 2:
56.66 % Slider
30.69 % Cutter
6.50 % Curveball
Cluster 3:
73.51 % Curveball
26.14 % Slider
.25 % Cutter
Cluster 4:
49.44 % 2-Seam Fastball
39.15 % Changeup
11.29 % 4-Seam Fastball
Top Three Pitches Within Each Cluster for RHP by %:
Cluster 1:
51.84 % 2-Seam Fastball
33.47 % Changeup
9.35 % 4-Seam Fastball
Cluster 2:
74.42 % Curveball
24.79 % Slider
.73 % Cutter
Cluster 3:
89.91 % 4-Seam Fastball
4.91 % 2-Seam
3.08 % Changeup
Cluster 4:
66.75 % Slider
23.32 % Cutter
4.35 % Curveball
The clusters above are created by only looking at the vertical and horizontal movement of each pitch. As you can see from the percentages of what pitch is in each cluster, just using the movement is not a great way to predict what pitch it is because many of the clusters were quite heterogeneous, signifying that many different pitch types were grouped in together. This clustering process simulated what the naked eye does when watching a game. As a result, we concluded that we needed to add other predictors such as spin rate and velocity, among other variables, to create a more accurate model that can actually be useful for a broadcaster or fans.
Examining Variables of Interest
Average Extension in May 2019 Dataset by Pitch Type
|
Pitch Type
|
Average Extension (in)
|
|
FF
|
54.34629
|
|
FT
|
54.42841
|
|
CH
|
54.43469
|
|
FC
|
54.61323
|
|
FS
|
54.62481
|
|
SL
|
54.73945
|
|
CU
|
54.83600
|
The table above exhibits that for any given pitch type, the average extension on the pitch will be essentially the same, as the differences between the average extension for each pitch type were nearly identical, meaning this will not serve as a meaningful predictor in our models.

The plot above shows us the release position from the home plate view of each pitch thrown in the MLB in May 2019. We can see that there is not enough of a difference to be able to use this to predict pitch type, at least not without using many other predictors. Due to the variance in mechanics from pitcher-to-pitcher, nearly each pitch is thrown from every potential arm angle.

The plot above shows the spin rate and velocity for all types of fastballs (Cutter, 4-Seam, Two-Seam). Cutters tend to have higher spin and lower velocity than the other two kind, 2-Seam Fastballs tend to have lower spin and the widest range of velocity, while 4-Seam Fastballs are typically accompanied by a higher velocity.

The plot above shows the average velocity range of each pitch and the substantial difference between pitch types demonstrates that this variable is likely to be a useful predictor in our models.
Modeling Using Training Data
Small Tree
Results of the Small Tree for LHP
|
cp
|
Accuracy
|
Kappa
|
AccuracySD
|
KappaSD
|
|
0.005
|
0.813857
|
0.7648121
|
0.0098596
|
0.0120441
|
Results of the Small Tree for RHP
|
cp
|
Accuracy
|
Kappa
|
AccuracySD
|
KappaSD
|
|
0.005
|
0.8179964
|
0.7639932
|
0.0039193
|
0.0047684
|
Small Tree Visualization for LHP

Small Tree Visualization for RHP

These basic trees use each predictor in our training dataset, except for pitch name, player name, batter handedness, home score, away score, as inputs for decision trees. We choose a complexity parameter that doesn’t yield the highest accuracy, sacrificing accuracy for the superior interpretability of the tree that stems from this cp. We can use this to better understand how the models are predicting what pitch is being thrown.The accuracies are listed above.
Large Tree


Based on the graphs above of the accuracies for different complexity parameters, we decided to choose a cp of 4e-05 for each model (to maintain consistency) and avoid overfitting the data to the training data. There are jumps in the accuracies for complexity parameters that are slightly greater than 4e-05 for both LHP and RHP, so we use a cp slightly lower than these jumps to ensure that the training data doesn’t sway the model too much.
Results of the Large Tree for LHP
|
cp
|
Accuracy
|
Kappa
|
AccuracySD
|
KappaSD
|
|
4e-05
|
0.8862902
|
0.8562046
|
0.0016356
|
0.0019782
|
Results of the Large Tree for RHP
|
cp
|
Accuracy
|
Kappa
|
AccuracySD
|
KappaSD
|
|
4e-05
|
0.8740224
|
0.8376267
|
0.0027736
|
0.0035247
|
These decision trees have the same inputs as the more basic tree except it uses the best complexity parameter for cross-validated accuracy. This would be better to use for actually predicting than the smaller tree, but the actual trees for this complexity parameter are much too busy to look at. The accuracies are listed above.
KNN Model


When deciding the optimal number of neighbors to use in our models, we prioritized high accuracy, while minimizing the total number of neighbors used, to maintain the simplicity of the model. As a result, our final models use 10 neighbors.
Results of KNN Model for LHP
|
k
|
Accuracy
|
Kappa
|
AccuracySD
|
KappaSD
|
|
10
|
0.8662288
|
0.8305736
|
0.0069872
|
0.0088534
|
Results of KNN Model for RHP
|
k
|
Accuracy
|
Kappa
|
AccuracySD
|
KappaSD
|
|
10
|
0.8569143
|
0.81519
|
0.0039993
|
0.0051095
|
Random Forest
Knowing the long processing time of running a random forest model, we only include our final model (mtry = 6) in this code, but we found that this value for mtry was the one that yielded the combination of most accurate results and minimized the chances of overfitting.
Results of Random Forest Model for LHP
|
Accuracy
|
Kappa
|
mtry
|
|
0.9297412
|
0.911039
|
6
|
Results of Random Forest Model for RHP
|
Accuracy
|
Kappa
|
mtry
|
|
0.9146547
|
0.889921
|
6
|
After running each of the models on the testing data for both LHP and RHP, it appears as though the random forest model is the most accurate one for predicting pitch types. However, to ensure that the model wasn’t overfit on the training datasets, we need to run each of the models on the testing datasets as well.
Testing the Models
Testing Results of Small Tree for LHP
Test Data Accuracy for Small Decision Tree Model for RHP
|
PCT
|
|
0.8201968
|
Testing Results of Small Tree for RHP
Test Data Accuracy for Small Decision Tree Model for RHP
|
PCT
|
|
0.814757
|
Testing Results of Large Tree for LHP
Test Data Accuracy for Decision Tree Model for LHP
|
PCT
|
|
0.897566
|
Testing Results of Large Tree for RHP
Test Data Accuracy for Decision Tree Model for RHP
|
PCT
|
|
0.8775331
|
Testing Results of KNN for LHP
Test Data Accuracy for KNN Model for LHP
|
PCT
|
|
0.8705334
|
Testing Results of KNN for RHP
Test Data Accuracy for KNN Model for RHP
|
PCT
|
|
0.8594774
|
Testing Results of Random Forest for LHP
Test Data Accuracy for Random Forest Model for LHP
|
PCT
|
|
0.9346453
|
Testing Results of Random Forest for RHP
Test Data Accuracy for Random Forest Model for RHP
|
PCT
|
|
0.9144065
|
As you can see from this test data, the random forest is clearly the most accurate model for predicting pitch types because it has both the best training and testing accuracy. As a result, we would use this as our go-to model for making predictions.
Models in Real-Time Action
Applying Our Best Model to Specific Pitches
Below we have randomly selected a handful of pitches from within the dataset. Looking at the GIFs of these pitches in combination with the variables of interest from the modeling, we applied our best random forest to make pitch type predictions (as it would do if we were using it to guess pitch types in real time)! The results are below:
Jordan Hicks 2-Seam Fastball
Video
|
Pitcher
|
Pitch Type
|
Prediction
|
Velocity
|
Spin Rate
|
Horizontal Movement
|
Vertical Movement
|
|
Jordan Hicks
|
FT
|
FT
|
103.7
|
2143
|
-1.2426
|
0.9238
|
Rich Hill Curveball
Video
|
Pitcher
|
Pitch Type
|
Prediction
|
Velocity
|
Spin Rate
|
Horizontal Movement
|
Vertical Movement
|
|
Rich Hill
|
CU
|
CU
|
73.6
|
2863
|
-1.4713
|
-1.1273
|
Luis Castillo Changeup
Video
|
Pitcher
|
Pitch Type
|
Prediction
|
Velocity
|
Spin Rate
|
Horizontal Movement
|
Vertical Movement
|
|
Luis Castillo
|
CH
|
CH
|
87.3
|
1875
|
-1.3342
|
0.2198
|
Chaz Roe Slider
Video
|
Pitcher
|
Pitch Type
|
Prediction
|
Velocity
|
Spin Rate
|
Horizontal Movement
|
Vertical Movement
|
|
Chaz Roe
|
SL
|
SL
|
79.3
|
3033
|
2.1146
|
-0.6335
|
Oliver Drake Splitter
Video
|
Pitcher
|
Pitch Type
|
Prediction
|
Velocity
|
Spin Rate
|
Horizontal Movement
|
Vertical Movement
|
|
Oliver Drake
|
FS
|
FS
|
84.1
|
697
|
-0.8211
|
0.6051
|
Conclusion
In conclusion, we were able to determine that a Random Forest model, splitting the data into LHP and RHP, is the most effective method for predicting pitch types in real time, given the right data. It could be a very useful tool for broadcasters and fans alike at MLB ballparks. It could even be an especially valuable tool at lower levels of baseball, such as the Minor Leagues (MiLB) or collegiate level, where each pitcher’s pitch types may be harder to decipher due to inferior velocity and the lack of prior knowledge on lower-level pitchers. If we were given more time to pursue this project, we would create an app that could be used in real time during games.